Corelab Seminar
2020-2021
Orestis Papadigenopoulos
Contextual Blocking Bandits
Abstract.
Only recently, researchers have focused their attention on variations
of the multi-armed bandit (MAB) problem, where specific temporal
dependencies are imposed between player's actions and reward
distributions. In this direction, Kleinberg and Immorlica [FOCS'18]
consider the setting of "recharging bandits", where the mean reward of
each arm is a concave and weakly increasing function of the time passed
since its last play. In a similar spirit, Basu et al. [NIPS'19] consider the
problem of "blocking bandits", where once an arm is played it cannot
be played again for a number of consecutive rounds.
We consider a novel variant of the blocking bandits problem where, at
each time step, the player observes an independently sampled context
that determines the arms' mean rewards. Assuming prior knowledge of
the (context-dependent) reward distributions, we derive an efficient policy
with optimal competitive guarantee based on randomized rounding. When
the mean rewards are initially unknown and due to the time correlations
caused by blocking, existing techniques for upper bounding the regret fail.
For proving our regret bounds, we introduce the novel concepts of delayed
exploitation and opportunistic subsampling, and combine them with ideas
from combinatorial bandits and non-homogeneous Markov Chain coupling.
The is joint work with S. Basu, C. Caramanis and S. Shakkottai.